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High speed and competent addition of various operands is an essential 
operation in the design any computational unit. The swiftness and power 
competence of multiplier circuits plays vital role in enlightening the overall 
performance of microprocessors. Multipliers play crucial role in the design of 
arithmetic logic unit (ALU) or any digital signal processor (DSP) that are 
effectively employed for filtering and convolution operations. The process of 
multiplication either binary numbers or fixed-point numbers yields in 
enormous partial products that are to be added to get final product. These 
partial products in number and the process of summing up partial products 
dictate the latency and power consumption of the multiplier design. Here, we 
present a novel binary counter design that hires stacking circuits, that groups 
all logic “1” bits as one, followed by a novel symmetric method to merge pairs 
of 3-bit stacks into 6-bit stacks and then changes them to binary counts. This 
results in drastic improvements in power and area utilization of the multiplier. 
Additionally, this paper also focuses on implementation of novel approximate 
compressor and exploits the same for the design of approximate multipliers 
that can be effectively employed in any electronic systems that are 
characterized by power and speed constraints. 
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1. INTRODUCTION 


The general process of multiplication includes generation partial products and all these partial 
products have to be added to get end result. The figure of partial products obtained depends on the length of 
the numbers that are being multiplied. The increase in the length of numbers that are being multiplied 
proportionally increases the number of partial products and this increase in partial products proportionally 
dictates the complexity of adders which dictates the circuit performance parameters like power and speed. This 
creates necessity to design a novel adder and multiplier circuits that are efficient with respect to power and 
speed. Moreover, these multiplier circuits are of very much needed in the design of arithmetic logic unit (ALU) 
and digital signal processor that predominantly work for filtering and convolution operations. Hence, the design 
of power and speed efficient adder and multiplier circuits also aids in improvements in performance of various 
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DSP and embedded processors. This section discusses in detail about various mechanisms that can be 
effectively applied for achieving high speed multiplications and their technological advancements. Many 
researchers conclude that the speed improvements in multipliers can be obtained by various methodologies like 
using hybrid structures, reducing the partial products, by employing counters and compressors. Similarly, to 
target power savings the multiplier architecture employs exclusive OR (XOR) and exclusive-NOR (XNOR) 
gates and also tries to achieve power saving by reducing the number adders present in the architecture and by 
employing counter and compressors. 

The design of hybrid high speed carry select adder using carry-lookahead adder (CLA) is presented 
in [1]. On similar lines analysis high speed radix 4 multiplier using Shannon adder that is suitable of digital 
signal processor (DSP) applications is presented in [2]. A similar sort of hybrid adder using analog and digital 
circuits is presented in [3]. The design of adders that are aimed to achieve power savings is given in [4], [5]. 
Design of highspeed multiplier using binary counters based on symmetric stacking is given in [6], [7] aims at 
achieving improvements in its speed by targeting the delays across critical paths. The new design of 7-2 and 
5-2 ultra speed compressors [8] leave faith that further improvements speed can achieved by employing the 
with regular structures of high speed multipliers. Recently various versions of Wallace tree structures [9] were 
implemented by the use of various other adders like Kogge stone adder, Sklansky adder, Brent Kung adder, 
Ladner Fischer adder and Han Carlson adder to speed multiplication process by employing parallel prefix 
adders. 

The well-known techniques like Wallace tree [10] and Dada tree [11] have successfully employed 
row compression techniques to achieve improvements in power and speed. A new design of column 
compression technique has been exploited [12]. High speed multiplier design by the use of counters and 
compressors is given in [13] offers significant step up over the one that is implemented by the use of (3.2) 
counter with respect to area overhead. Implementation of algorithmic Wallace tree multiplier using high speed 
counters is presented in [14] proves to be a superior strategy for the aiming power efficient high speed 
multiplication. 

The new design Wallace tree multiplier [15] that could give significant improvements in overhead can 
be designed by employing majority logic. Another version of highspeed multiplier that employs CLA in the 
structures of Wallace tree and dada tree multiplier is given in [16]. The detailed study of various highspeed 
adders were given in [17] conclude that dada multiplier is quite faster than Wallace tree multiplier. The analysis 
of different counter-based architectures of Wallace tree multipliers is presented in [18] yields inference that 
counter- based multipliers achieves higher speed of operations while providing significant optimizations in 
area overhead. 

According to Lin [19] proves that the significant improvements in speed can also be attained by 
employing stage reduced partial product reduction network that is built using parallel counters and shift 
compressors. The improvements in speed can also be achieved by the use of irreducible pentonomials [20] 
which is special case Galois field multipliers. Architectures that aim at high speed and low power consumption 
using multiplication were presented in [21], [22]. On similar lines, the design of 3-2 counter and 4-2 compressor 
designs that are well suitable for fast multiplication and the designs of low power 4-2 and 5-2 compressors are 
given in [23]—[25]respectively. The design of 4-2 compressor using XOR and XNOR is presented in [26]. The 
design of 7-2 compressor is presented in [27] gives considerable improvements in speed and power is by 
minimizing the delays associated with critical path. A 1.2-ns 16x16-bit binary multiplier using high speed 
compressors is presented in [28]. 


2. METHOD 

The proposed multiplier is designed by employing the 6-3 compressor for reducing partial products. 
The 6-3 compressor is designed based on principle of stacking and finally the stacked count is converted into 
binary count. Here first we discuss complete details regarding counter design section 2.1, then detailed 
discussion on the process of stacking is carried out in section 2.2 and finally we end section 2.3 by the 
discussing the process involved in converting the stacked output to binary count. 


2.1. Counter design 
Figure 1 gives design details of 6-3 compressor. The basic block diagram of 6-3 compressor is given 
in Figure 1(a) that works on strategy of stacking. The operation of 6-3 is pretty straight forward; among the 
given six inputs first three bits (AO, Al, and A2) are given to one full adder! (FA1) and the remaining three 
bits (A3, A4, and A5) are given to the other full adder (FA2). The sum output of FAI and FA2 are further 
given to the half adder (HA1) to compute final sum output S where the carry outputs of FA1, FA2 and HAIL 
are given to the full adder 3 (FA3) to obtain final carry outputs C1 and C2 
The schematic of 6:3 counter is depicted in Figure 1(b) is implemented by the use of CLA concept to 
make use of propagate and generate signals to speed up addition processes and this can aid in improving the 
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latency of the multiplier. The P (propagate) and G (generate) equations of 6-3 counter are given in (1) and (2) 
respectively. The Boolean expressions for S (Sum), C1 (Carry Outl), and C2 (Carry out 2) are given by the 
(3)-(5) respectively. 


PO = ABP1=CDP2=EF (1) 
GO = A.BG1 = C.DG2 = E.F (2) 
S = POP1P2 (3) 
C1 = (PO.P1 PO. P2 P1.P2) (G0G1 G2) (4) 


C2 = (G0.G1 + G0.G2 + G1.G2) + ((P0.P1).G2) + 
((PO. P2).G1) + ((P1.P2).G0) (5) 


(b) 


Figure 1. 6-3 compressor (a) basic principle and (b) circuit diagram 


2.2. Bit stacking 

The stacking is a process of grouping all input logic 1’s together. After stacking, all these stacked bits 
are transformed to binary count to get 6-bit count. Initially stage 3-bit stacking circuits are employed to obtain 
three-bit stack then all employed three bits stacks are merged to obtain 6-bit stack. The basic stacking circuit 
and the process of stacking is given in Figure 2. Figure 2(a) gives 3-bit stacking circuit in which P1, P2 and P3 
are given as inputs to the three-bit stacking circuit that yields Q1, Q2 and Q3 as outputs. As we are grouping 
logic 1’s together, the total number of logic 1 bits in the output is same as total number of logic 1’s at the input. 
The processes of grouping logic 1’s together includes, grouping of all logic 1’s to the left followed by the logic 
0’s. The outputs Q1, Q2 and Q3 of the 3-bit stacking circuit are characterised by (6), (7) and (8) respectively. 


Q1 = Pl + P2 + P3 (6) 
Q2 = POP1 + POP2 + P1P2 (7) 
Q3 = POP1P2 (8) 
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From the above functions it is quite clear that the output Q1 is logic 1| if any of its input is at logic], 
output Q2 is logic 1 if at least two of its inputs are at logic 1 and output Q3 is logicl if and only if all of its 
inputs are at logic 1. Now the outputs of two three-bit stackers are merged into binary count to obtain 6-bit 
count. To detailed explanation of these merging processes is illustrated in Figure 2(b). Now, let us assume there 
are six inputs X1, X2, X3....., X6. These six inputs are divided into two groups of 3bits of each and these three 
bits are stacked by employing 3-bit stacking circuits. Let X1, X2, and X3 are stacked to signals that is to say 
Y1, Y2 and Y3, and X4, X5, and X6 are stacked to signals named Z1, Z2, and Z3. 

From an example illustrated in the Figure 2(b) it can be noticed that there is train of logic 1’s bounded 
by logic 0 bits. To get proper stack, we have to move all logic 1’s positioned to left followed by logic 0’s. To 
get this we employ two more three-bit stacking circuits which are fed by the output of merged six bit stacker 
ie., Y3, Y2, Y1, Z1, Z2, and Z3. To have a better understanding these outputs obtained from the example are 
represented using two three-bit vectors namely L (L1, L2, and L3) and M (M1, M2, and M3) which are 
connected to two three-bit stacking circuits. These two 3-bit stacking circuits are combined to get another 6- 
bit stacking circuit. To have proper stack operation i.e to fill all logic 1’s positioned to left followed by logic 
0’s, we apply strategy of filling vector L with logic 1’s before filling vector M. Hence, we define expressions 
based up on the requirement of proper stack operation. 


Ll = Y3 + Z1 (9) 
(2 = 72 4-72 (10) 
if = Vis zs (11) 


EE a a eT 
STACKER STACKER 
SS ES 


(a) (b) 


Figure 2. Stacking (a) circuit and (b) processes 


In a ‘M’ bit vector if the total number of 1’s are less than are equal to three places then all M bits will 
be filled with logic 0. This drives few of the AND gates in the stacking with logic | as their inputs which aids 
in crafting power efficient architecture. For better understanding of the logic behind the crafting of stacking 
circuit notice that L1 L2 L3 and M1 M2 M3 will contain equal number of 1’s as input with only difference 
being that L bits are crammed with logic 1’s ahead of any of M bits. 
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2.3. Conversion of stacked bits to binary count 

The successful implementation 6-3 counter requires conversion of stacked bits into binary counts. The 
intermediate values of Y, Z and M are employed to achieve this conversion. We know that the outputs C2, C1, 
and S are in binary representation of number of 1’s present at the input of 6-bit stacker. Thus, output S can be 
determined by determining parity of the output present at initial layer of 3-bit stacker. If the number of 1’s in 
X1, X2, and X3 are ‘0’ or ‘2’ then it results in even parity in Y. On the other hand, if the number of ‘1’s in X4, 
X5, and X6 are ‘0’ or ‘2’ then it results in even parity in Z. Thus, to indicate even parity in Y and Z. Here Ye 
and Ze are used to represent even parity in Y and Z respectively. The Ye and Ze of Y and Z bits are given by 
(12) and (13) respectively. 


Y, = YE + ¥YE (12) 
Zo = Zi + ZoZ8 (13) 


The sum (S) indicates odd parity overall input bits CXOR operation) in other words addition of two 
numbers with distinct parities is odd. Although there is XOR gate to obtain sum is present in 6-3 compressor, 
this XOR gate is not associated with the critical path. Thus Sum (S) can be computed by the (14). 


S = Y,XORZ, (14) 


On same lines to obtain C1, it has to be noted that C1 will be logic | for the counts of 2,3o0r 6. This 
gives raise to two cases. One is we have to verify at least two but not more than three inputs. For this we 
employ, Y, Z, M vectors. To verify for at least two inputs, we have to verify stack of length two. This may be 
done from crest level stacker or from two stacks whose length is one and this yield Y + Z + Y,Z,. On the other 
hand we have to verify that there are no more inputs than three and we should confirm that all bits of M are 
reset and M vector is only set in which the inputs are not more than three, this gives (Mi+M2+Ms3)°. The other 
is that we have to verify all six inputs as logic ‘1’. This can be done by verifying all bits associated with Y and 
Z vectors. As Y and Z are bit stack it would be sufficient to verify right most bit in the stack. This yield C1 = 
Y, + Z2 + H,1,. The computation of C2 is be easily done as its function is to set every time when there are 
minimum of 4-bits set, which gives C2 = M, + Mz + M3. 


3. RESULTS AND DISCUSSION 

The proposed architecture’s namely, 6-3 compressor and the approximate multiplier that has designed 
with the aid of 6-3 compressor. The functional verification is carried out using modelsim and its 
implementation is carried in Xilinx to extract various features like power, area and speed. Here, power is 
expressed in mW, area is expressed in terms of LUT’s and where as speed is expressed in nS. 


3.1. Power consumption 

The power consumed by the various architectures that are considered for experimental purpose are 
given in Table 1. The architectures column gives the details regarding the various multiplier architectures of 
interest. Similarly, the columns under static power and dynamic power give details regarding the static and 
dynamic powers consumed by the corresponding architectures. Finally, column total power gives estimates of 
total power consumed by an architecture which is almost equal to the summation of static and dynamic powers 
consumed by an architecture. Here, total power consumed by the proposed architecture is 155 mW. This 
includes the static power consumption of 33.6 mW and as well as dynamic power consumption of 121.83 mW. 


Table 1. Comparison of power consumption of proposed multiplier 


Architecture’s Static power Dynamic power _ Power (mW) 
Wallace tree multiplier in [12] 33.6 151.74 185.34 
Multiplier architecture by employing 4-2 and 5-2 compressor [18] 33.6 138.96 172.58 
Binary multiplier based on stacking [6] 33.6 136.82 170.42 
Proposed approximated binary multiplier 33.6 121.83 155.43 


3.2. Area overhead 

The overhead attained by various architectures that are considered for experimental purpose are given 
in Table 2. The column under architectures gives the details regarding the various multiplier architectures that 
on consider for the experimental work. The columns of total no. of 4 input lookup tables (LUT’s) used and 
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number of slices gives details regarding the number of 4-bit LUT’s and the slices that are employed in the 
architecture design. Finally, the column of area overhead gives the details regarding number equivalent gate 
counts of LUT’s and slices that are required to implement the given design. Here, the amount area overhead 
associated with the proposed multiplier in term of its gate equivalents is 1,146. 


Table 2. Comparison of area overhead of proposed multiplier 


‘Ayohitedinte’s Total no. of Number of Area overhead 
LUT’s slices (GC) 
Wallace tree multiplier in [12] 100 88 750 
Multiplier architecture by employing 4-2 and 5-2 compressor [18] 163 101 1,042 
Binary Multiplier based on stacking [6] 199 107 1,212 
Proposed Approximated binary multiplier 188 100 1,146 


3.3. Delay 

On same lines delay attained by various architectures that are taken for the comparison purpose are 
presented in tabulated in Table 3 and the values under column delay gives, the delay attained by the 
corresponding architecture in nano seconds (nS). The values presented here clearly states that proposed 
architecture performs better in terms of its speed of operation. The design summary of proposed architecture 
and all the other architectures that are consider for experimental purpose is given in Table 4 in terms of their 
performance parameters of power, area overhead and delay. The columns of power, area overhead and delay 
give the values of power, area overhead and delay attained by the corresponding architecture. The total power 
consumed by the proposed architecture is of 155 mW, which includes static power consumption of 33.6 mW 
and dynamic power consumption of 121.83 mW. The amount area overhead associated with the proposed 
multiplier in term of its gate equivalents is 1,146 which includes 188 4-bit LUT’s and 100 slices. The total 
delay of incurred by the proposed is of 32.902 ns. 


Table 3. Delay comparison of proposed multiplier 


Architecture’s Delay (ns) 
Wallace tree multiplier in [12] 37.333 
Multiplier Architecture by employing 4-2 and 5-2 compressor [18] 31.494 
Binary multiplier based on stacking [6] 32.974 
Proposed approximated binary multiplier 32.902 


Table 4. Design parameters summary of various architectures 


Architecture’s Area Power(mW) __ Delay (ns) 
Wallace tree multiplier in [12] 750 185.34 37.333 
Multiplier architecture by employing 4-2 and 5-2 compressor [18] 1,042 172.58 31.494 
Binary multiplier based on stacking [6] 1,212 170.42 32.974 
Proposed approximated binary multiplier 1,146 155.43 32.902 


The Figure 3 gives performance of various simulation parameters for all the architectures that are 
considered for experimental purpose. Figure 3(a) gives the pictorial representation of static, dynamic and total 
power consumed by the different architectures that are considered for the experimental purpose. The blue curve 
shows static power dissipation inoccured by various architectures and similarly, red color and green color cures 
gives details about dynamic and total power consumptions inoccured by different architectures. Figure 3(b) 
gives the pictorial representation of hardware requirements of various architectures in terms of LUT’s, slices 
and gate count. The red and blue color curve gives information about number for LUT’s and slices that utilized 
in the design of different architectures. Similarly, the green color curve gives information about total equivalent 
gate count attained by the LUT’s and slices. From these cures, it can be clearly observed that the proposed 
architecture has less hardware requirements than that of [6] and has slightly more hardware requirement in 
comparison with [12] and [18] which suffer from increased power consumption and latency. The red color 
graphs of pictorial representation given in Figure 3(c) indicates delay inoccured by different architectures that 
are consider for comparison. This clearly states that the proposed architecture attains very less delay compared 
to all the architectures that are taken for the comparison. Figure 3(d), gives finalized summary of the 
implemented architecture in terms of power consumption, area overhead and delay. 
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Figure 3. Simulation parameter (a) power consumption, (b) area over head, (c) delay, and (d) savings attained 
in power consumption, area over head and delay 


4. CONCLUSION 

Here we have presented a new compressor-based multiplier which is successful in eliminating the 
XOR gates associated with the critical path which in turn resulted in speed improvements. The experimental 
results convey fact that the proposed multiplier is successful in achieving 12% improvements in speed as 
compared to other existing architectures. The proposed architecture is also successful in achieving power 
improvements of 20% as compared to Wallace tree multiplier and 10% of power saving as compared to 5-2 
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and 4-2 multiplier architecture. Further, larger savings of power and increased speed of multiplier can obtained 
by employing very low power 7-2 compressors that work at ultra low power values and low power design 
concepts in the actual design of 7-2 compressors and adders that are employed for compression. 
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